Design Refinements in MapReduce: Part I
Let's introduce some execution-related improvements to MapReduce's design.
Real-world systems are rarely designed in one go—it often takes many iterations to improve the design. As initial versions of our system are deployed in production, we get usage data and possibly new insights. In this and the next lesson, we will improve many aspects of MapReduce design.
Ordering our refinements goes along with the execution flow of the system.
Input and output types#
Let’s analyze the supported input and output types by the MapReduce library.
Input types#
By default, the MapReduce library supports reading a limited set of various input data types. Each input type implementation automatically handles the data splitting into meaningful ranges for further processing by the Map tasks.
Example
As we know, the data gets partitioned into key-value pairs before it is processed by the Map tasks. The “text” mode input processes each line as a key-value pair, such that:
- The key is an offset in the input file.
- The value is the content of that line.
This mode ensures that the partitioning happens only at the line boundaries.
Support for new input types
Based on the desired functionality, the users can also define a new reader interface to add functionality for a new input type. For example, we can define a reader to read data from a database or a memory-mapped data structure.
Output types#
The MapReduce library also supports various output types by default, and similar to the input types, it also gives the functionality to define new output types.
Using custom types for data is a powerful extension that enables end programmers to read and write data from many different sources and sinks.
Partitioning function#
The distribution of the intermediate data to each of the user-defined partitions is handled by the partitioning function. By default, the MapReduce library provides a partitioning function.
Customization of the partitioning function#
The partitioning for many computations can be reasonably uneven in the real world, resulting in poor speed-up because only a few workers do most of the work (because the hash function sends most of the data to a few buckets). To achieve linear speed-up, it is critical that each partition gets roughly the same amount of data so that all the available servers can be employed to do the work in parallel.
Instead of using the default partitioning function of , such use cases demand partitioning the data by another different function of the output key.
Note: The user can define a customized function to partition the data across partitions of the output file.
Example
An example scenario can be where the output keys are URLs, and we want to partition the URLs from a single host to a separate partition of the output file. If we use the default partitioning function of , we wouldn’t be able to generate separate files per hostname. To achieve the desired output, we can modify the partitioning function to .
Let’s assume we have the following URL to a course on the Educative platform: https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers. Instead of taking the hash of this URL, we first resolve its hostname, which will be www.educative.io in this case, and then take the hash of that hostname. This way, all the Educative URLs will map to just one Reduce partition instead of going to various partitions if we had taken the hashes of URLs.
1 of 3
2 of 3
3 of 3
The Combiner function#
As established earlier, a reducer gets input from all the Map tasks, with each mapper incarnation contributing a portion (a bucket out of R buckets). When received on the reducer, this input might contain a significant repetition having multiple batches of similar output keys. All these repetitions waste the network bandwidth and should be dealt with before sending them out on the network. The user can define a customized combiner function to merge the similar output keys’ data before sending it out to the Reduce function, noticeably saving the network bandwidth and speeding up certain MapReduce operations at the reducer.
Note: The
Reducefunction is commutative and associative, so adding the combiner function before it doesn’t affect its operation.
Example
Let’s take the example of the word count problem. The Map function produces millions of records of the form . Instead of sending these records individually to the Reduce function, we can partially merge them by doing a local sum and sending that result instead. It will avoid the unnecessary network bandwidth being burnt and save a lot of time by the Reduce function, which would have to merge these records individually otherwise.
Comparison of the Combiner function and the Reduce function#
Functionality wise, there is no difference between the Combiner function and the Reduce functions. Typically, we use the same code for both these functions. The only difference is their implementation location and the handling of their outputs by the MapReduce library.
- The
MapReducelibrary writes the output of theCombinerfunction to the intermediate file destined to be sent to a reducer. - The
MapReducelibrary writes the output of theReducefunction to the final output file.
Guaranteed ordering#
The system needs to distribute the data among workers effectively and logically to engage as many workers as possible in the cluster. This implementation also has underlying by-products that ensure that the final output is sorted and can be analyzed accordingly.
The MapReduce library ensures that the intermediate pairs are processed, within a partition, in increasing key order, ensuring sorting during all operations. It makes it feasible for reducers to write outputs in a sorted manner, facilitating quicker random access lookups on keys in the final output file.
A sorted order of intermediate keys might help to simplify the Reduce function. For example, for word counting, when the keys starting with the finish, the Reduce function knows that it will not get the key the after that and can emit its output. If intermediate data was not sorted on the keys, the Reduce function would have to keep the partial sums in memory until it processes all of its data.
Side effects#
In addition to having only the program-generated output files for Map or Reduce tasks, the user can also produce additional auxiliary output files (side-effects) for both these functions (possibly for debugging purposes).
Restrictions on the side-effects#
By default, in such an added functionality, the MapReduce library outputs to a temporary file
and eventually renames it when it has entirely created it. The sole responsibility of making these side-effects atomic and idempotent lies on the application writers (users).
Let’s see the process of an auxiliary file generation using the slide deck below:
1 of 4
2 of 4
3 of 4
4 of 4
Sometimes, the users wish to produce multiple files from a single task. The MapReduce library does not support atomic two-phase commits in this case. Therefore, the tasks producing several output files while ensuring consistency should be deterministic.
The MapReduce library expects the Map and Reduce functions code (provided by the programmers) to be deterministic and idempotent so that potentially repeated executions on the same data produce the same results. If the Map or Reduce code is not idempotent or has side effects that are not idempotent, then the programmer should be aware that the final results might not be valid.
Note: Even with MapReduce’s restricted programming model and further expectations of the idempotent
MapandReducecode, this library is applicable to a wide range of applications.